The limits of a decoupled out-of-order superscalar architecture
نویسنده
چکیده
This thesis presents a study into a technique for improving performance in outof-order superscalar architectures. It identifies three technological trends limiting superscalar performance; they are the increasing cost of a main memory access, control dependencies and the greater hardware complexity of out-of-order execution. Decoupling is a technique that can provide higher performance through the mechanism of dynamically reordering, asynchronous instruction streams. It offers the capability to improve ILP, through effective latency hiding and dynamic scheduling, and to reduce hardware complexity, through decentralised logic. This thesis evaluates this capability, by investigating the effectiveness of decoupled out-of-order superscalar architectures. This thesis identifies the degree to which operations can reorder (the degree of reordering) as the critical dimension to an out-of-order superscalar architecture. It investigates the effectiveness of decoupling by focusing on those design issues that determine the degree of reordering, and relaxes all other architectural constraints. This approach allows us to establish the limitations of decoupled out-of-order superscalar architectures. This thesis shows that a decoupled architecture, through its dynamically reordering instructions windows, provides a possible solution to the problem of latency hiding and issue logic complexity. This thesis demonstrate that for large memory latencies, a decoupled architecture with 2 instruction streams is less sensitive to increases in memory latency than a conventional single stream superscalar architecture. The results also show that for memory latencies greater than 20 cycles, a decoupled architecture can achieve a higher speedup than a conventional superscalar architecture with twice the individual window sizes of a decoupled unit. An explanation for this effect is provided through the concept of the Effective Window Size. The thesis also investigates a 3-stream decoupled superscalar architecture, that provides dedicated hardware support for resolving control dependencies. The results show that for the partitioning algorithm used in this thesis, the load balancing is poor and the extra hardware resources are under utilised. For this reason the majority of the thesis focuses on a 2 stream decoupled architecture.
منابع مشابه
Simplifying Hardware for Out Of Order Execution using the Decoupling Paradigm
Future hardware and software technology will try to provide improved performance by extracting higher levels of parallelism. However the cost of a main memory access-in terms of missed instruction issue slots-increases with faster processors and greater issue widths. For this reason latency hiding technology remains one of the most important parts of high performance processor designs. In this ...
متن کاملA Comparison of Superscalar and Decoupled Access/Execute Architectures
This paper presents a comparison of superscalar and decoupled access/execute architectures. Both architectures attempt to exploit instruction-level parallelism by issuing multiple instructions per cycle, employing dynamic scheduling to maximize performance. Simulation results are presented for four different configurations, demonstrating that the architectural queues of the decoupled machines p...
متن کاملDesign and Effectiveness of Small-Sized Decoupled Dispatch Queues
Continuing demands for high degrees of Instruction Level Parallelism (ILP) require large dispatch queues in modern superscalar microprocessors. However, such large queues are inevitably accompanied by high circuit complexity which correspondingly limits the pipeline clock rates. This is due to the fact that most of today’s designs are based upon a centralized dispatch queue which depends on glo...
متن کاملCharacterization of embedded applications for decoupled processor architecture
Needs for performance on embedded applications will lead to the use of dynamic execution on embedded processors in the next few years. However, complete out-of-order superscalar cores are still expensive in terms of silicon area and power dissipation. In this paper, we study the adequation of a more limited form of dynamic execution, namely decoupled architecture, to embedded applications. Deco...
متن کاملDecoupled Architectures for Complexity-Effective General Purpose Processors
Decoupled architectures have previously been investigated in the context of high performance scientific computing. For general purpose computing, however, superscalar processors have proven to be flexible in providing high performance across a wide range of applications. To achieve this goal, these architectures have incorporated enormous amounts of complexity to obtain modest performance impro...
متن کامل